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Abstract 

LAPACK and ScaLAPACK are arguably the defacto standard libraries among 
the scientific community for solving linear algebra problems on sequential, shared- 
memory and distributed- memory architectures. While ease of use was a major design 
goal for the ScaLAPACK project; with respect to its predecessor LAPACK; it is still 
a non-trivial exercise to develop a new code or modify an existing LAPACK code to 
exploit processor grids, distributed- array descriptors and the associated distributed- 
memory ScaLAPACK/PBLAS routines. In this paper, we introduce what we be- 
lieve will be an invaluable development tool for the scientific code developer, which 
exploits ad-hoc polymorphism, derived-types, optional arguments, overloaded op- 
erators and conditional compilation in Fortran 95, to provide wrappers to a subset 
of common linear algebra kernels. These wrappers are introduced to facilitate the 
abstraction of low-level details which are irrelevant to the science being performed, 
such as target platform and execution model. By exploiting this high-level library 
interface, only a single code source is required with mapping onto a diverse range 
of execution models performed at link-time with no user modification. We conclude 
with a case study whereby we describe application of the LAW library in the im- 
plementation of the well-known Chebyshev Matrix Exponentiation algorithm for 
Hermitian matrices. 
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1 Introduction 



While tlie memory-wall eff'ect[l], wliicli describes the growing disparity be- 
tween CPU speed and memory latencies, has been well known in the high- 
performance computing (HPC) community for many years, only recently it 
seems has there been renewed interestO] in the disparity between the "time 
to solution" of production codes and the "tzme to solution" of the respective 
code development. 

Throughout the history of supercomputing, the progress of high performance 
architectures has been relentless, while unfortunately their companion soft- 
ware tools and environments have generally failed to keep pace. Users of such 
systems, who are frequently HPC non-specialists, are typically presented with 
ever-evolving, non-intuitive, and complex parallel-programming tools, libraries 
and development environments, which can require valuable time and prior de- 
velopment experience, before they can be effectively exploited to attack the 
computational task at hand. 

To tackle the imminent crisis in HPC productivity versus HPC peak perfor- 
mance, it is imperative that high-level, user friendly, portable library tools 
are developed which abstract away the underlying system-dependent com- 
plexity and provide the user with a simple, yet powerful interface allowing 
them to concentrate solely on solving the scientific problem at hand, without 
being distracted by lower-level architectural issues. Furthermore, the scientific 
code developer should not be inconvenienced with maintaining separate code 
branches for each target machine. In the ideal case, a single code should be 
maintained which can be mapped onto serial, shared-memory or distributed- 
memory architectures at link time with no modification required to the original 
source. 

While influential HPC libraries such as MPI, LAPACK 3.0, BIAS, ScaLA- 
PACK and PBLAS0, have succeeded in raising the abstraction bar, they can 
still arguably be considered as low-level tools, due in part to their mnemonic 
routine signatures, complicated invocation arguments and reliance on architecture- 
dependent features (for instance, during the initialization and invocation of 
message-passing frameworks), particularly by those users with limited devel- 
opment experience. While historically this development approach may have 
been unavoidable, modern programming languages possess advanced abstrac- 
tion features such as modularity and polymorphism whereby current HPC li- 
brary tool developers should be able to fully abstract away system dependent 
issues, leaving the developer to concentrate on the scientific domain details to 

^ As exemplified by the DARPA High-Productivity Computing Systems Project - 
http: / / www.darpa.mil / ipto /programs /hpcs / 

also commercial libraries such as NAG and Visual Numerics 
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which they are more comfortable and accustomed. For example, the PETSc 
library [6] is an excellent example of a tool promoting abstraction which is 
developed with modern software engineering principles, whereby users can be 
experiment freely with different solvers (both serial and parallel) in the solu- 
tion of systems modeled by partial differential equations. 

In this paper we introduce a set of high-level wrapper library routines to 
the high-performance BLAS and PBLAS libraries, which are independent of 
the target architecture, for a set of common linear algebra operations, by 
exploiting advanced features of Fortan 95 such as generic functions, derived- 
types, optional arguments, overloaded operators and source code preprocess- 
ing. By providing a single abstract user interface, developers are excused from 
considering a given architecture, whether it be serial, shared-memory multi- 
threaded or distributed-memory parallel when invoking the routines and rea- 
soning about their data storage semantics. It is believed such an environment 
will result in the rapid prototyping and development of scientific codes with 
automatic multi-architecture executables being generated during the linking 
phase, with minimal user interaction. 



2 Linecir Algebra Librciries 

LAPACK[2] and ScaLAPACK[4] and their respective lower level building block 
libraries, BLAS and BLACS/PBLAS, are arguably the defacto standard li- 
braries among the scientific community for solving linear algebra problems on 
sequential and parallel architectures. While ease of use was a major design 
goal for the ScaLAPACK continuation project; with respect to its predecessor 
LAPACK; it is still a non-trivial exercise to develop a new code or modify an 
existing LAPACK code to exploit a processor grid, distributed-array descrip- 
tors and the associated distributed-memory ScaLAPACK routines. Further- 
more, developing separate codes with individual LAPACK and ScaLAPACK 
routine instrumentation can lead to additional project management headaches 
including multiple code branches and code bloat. 

A novel approach to developing code with these high performance libraries 
would be to exploit a high-level library interface which disassociates the de- 
veloper from the underlying architecture (whether it be serial or parallel) and 
allows them to concentrate on solving the scientific problem at hand with- 
out being concerned with cryptic routine names and their type associativity, 
numerous invocation parameters and dependence on a particular architec- 
ture (which may or may not require processor grids, blocking factors and 
distributed-array descriptors). To illustrate this goal, consider the routine in- 
vocations (written in Fortran syntax) given in Example 1, which perform a 
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standard dense matrix-matrix multiply^ I operation. 



Example 1 Four Sample Dense Matrix-Multiply Routine Invocations 



1. CALL SGEMMCTRANSA, TRANSB,M,N,K, alpha, A(l, 1) ,LDA,B(1,1) ,LDB, beta, & 

C(l,l) ,LDC) 

2. CALL PSGEMM(TRANSA,TRANSB,M,N,K,alpha,A,l,l,DESCA,B,l,l,DESCB, & 

beta,C, 1, 1,DESCC) 

3(a). CALL MATRIX_MULTIPLY (A, B,C, alpha, beta) 
3(b). C = alpha*A*B + beta*C 



Invocations 1 and 2 typify the calls required to perform a matrix-matrix mul- 
tiply operatioiT^ within the sequential BLAS and parallel PBLAS libraries 
respectively. It should be remarked that the calling syntax of both these rou- 
tines somewhat betray their functionality by confusing their identity with 
oversized arities and cryptic arguments. Furthermore, invocation 2 assumes a 
distributed matrix representation with appropriately initialized array descrip- 
tors while no such information is required in invocation 1. 

Invocation 3(a) illustrates an improved calling interface whereby no assump- 
tion is made about the system-dependent storage structure of the matrices 
(i.e. is the matrix distributed or non-distributed); the developer is only con- 
cerned with supplying the appropriate values that describe the mathematical 
procedure undertaken. While it might be argued that invocations 1 and 2 pro- 
vide greater flexibility in describing the multiplication process (by allowing 
the selection of array slices and submatrices) , it can easily be shown that op- 
tional arguments provide a more modern and flexible approach to customizing 
calling method 3(a) if required. For instance, the developer can override the 
default behaviour of the matrix-multiply procedure by augmenting the calling 
signature with optional arguments as illustrated in Example 2. 



Example 2 Flexible Routine Invocation with Optional Arguments 



CALL MATRIX_MULTIPLY(A, B,C, alpha, beta, transposeA='yes' , & 
startRowB=5 , startColumnB=10 ,rowsB=20 , columnsB=30) 



In this code segment the user has expressed their wish to multiply the trans- 
pose of matrix A with a 20 by 30 submatrix defined at row 5 and column 10 

of the form C = aAB + pC 
^ For single-precision real-valued dense matrices which are not submatrices 
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in the global matrix B. While these additional arguments unavoidably incre- 
ment the arity of the calling signature, competing in size with the BLAS and 
PBLAS versions, it must be noted that optional arguments are not mandatory 
(by definition) and hence are not always required, particularly if the default 
behaviour of the matrix multiply process is satisfactory. While optional argu- 
ment interfaces are not revolutionary in the scientific programming domain 
(as exhibited by existing libraries such as those provided by NAG, Visual Nu- 
merics and LAPACK v3.0), the LAW library aims to raise the abstraction bar 
even higher by ensuring that a statement of the form described by 3(a) is fully 
sufficient and complete to be invoked on any given system, irrespective of its 
execution model, with no further modification or instrumentation. 

Furthermore, statement 3(6) in Example 1 illustrates an alternative, and ar- 
guably more natural calling interface for the matrix-multiply operation when 
the default behaviour of the multiplication is required i.e. no subarrays, slices 
or non unitary array steps are required in the definition of the matrix multipli- 
cation operatioi£3- For invocation 3(6) to be realisable, the intrinsic operators 
-|-, * and = must to be defined over the data types describing the matrix 
objects A, B and C. This overloading of intrinsic operators is fully supported 
in Fortran 95 and has been exploited in the LAW library to provide the most 
natural interfaces for operations over matrices and vectors, again irrespective 
of machine architecture and execution model. 

As motivation for the detailed description of the LAW library to follow, con- 
sider the program code given in Example 3. This code leverages the LAW 
library to perform a trivial matrix-multiply operatioiJZ] across a diverse class 
of computing architectures. In this example LAW library routine^ are in- 
voked to create memory storage for three real- valued dense matrices A, B 
and C, each with shape 100 x 100. Optional arguments accepted by the LAW 
MATRIX CREATE routine enable the matrices to be initialised to default 
values, in this case 1.0 and 2.0, for matrices A and B respectively. Finally a 
simple matrix-multiply operation is performed on matrices A and B with the 
resultant matrix stored in both the matrix C and the the conventional array 
D. 

While this task can easily be implemented using existing BLAS and PBLAS 
routines, the fiexibility of the LAW library and its higher level abstraction 
interface becomes manifest when it can be shown that this single program 
source can exploit both serial and parallel distributed-memory systems (over 
n processors) without modification. 

When compiled against the serial LAW library on a serial architecture, all 

^ in other words, multiplication is performed over the entire matrix array objects 

for dense matrices of rank 2 and size 100 
^ made accessible to the program via the use clause 
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Example 3 LAW Library Program Example 



program multiply 
use LAW 
implicit none 

type (real_2D_matrix) : : A,B,C 

real, dimensionClOO, 100) :: D 

logical : : successful 

! Initialise LAW System 
call LAW_INITIALISE() 

call LAW_CREATE_MATRIX (A, 100, 100, successful, initial Value=l) 
call LAW_CREATE_MATRIX(B, 100, 100, successful, initial Value=2) 
call LAW_CREATE_MATRIX (C , 100 , 100 , successful) 

C = A * B 
D = C 

! Terminate LAW System and Program 
call LAW_EXIT(ok) 

end program multiply 



matrices are implicitly represented using conventional serial array notation 
and serial BLAS xGEMM routines are invoked to perform the matrix mul- 
tiplication. Alternatively, if the code is compiled against the parallel LAW 
library and executed on a distributed-memory architecture, then all matrices 
are implicitly partitioned and distributed block-cyclicall}0 across the available 
user-defined processor gricf^ using user-supplied blocking factors. To perform 
the multiplication operation, a PBLAS PxGEMM routine is implicitly invoked 
with the resultant matrix stored in the distributed matrix C. Furthermore, in 
this example, the elements of the distributed matrix C are gathered and copied 
to the conventional 2D array on each no dS Such an operation can be 
very useful for collecting together the elements of a distributed matrix as a 
single entity on a single node e.g. sequential I/O from a master node. 



^ as required by ScaLAPACK's distribution scheme 

10 grid configuration is read from a simple parameter file within the LAW Library's 
Initialisation routine 

by exploiting the ScaLAPACK PxGEMR2D redistribution routine 
the parallel code is executed in single program multiple data (SPMD) fashion 
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Serial 



Multithreaded Distributed-iviemory 
Fig. 1. LAW Library Hierarchy 



Furthermore, the serial LAW hbrary could easily exploit multiple threads of a 
symmetric multiprocessing (SMP) architecture by linking to vendor-supplied 
multithreaded BLAS routines, such as those provided by Intel's Math Kernel 
Library (MKL), during the linking phase of the application build. This process 
will be discussed further in §4. 

In each case, overloaded operators and generic functions provide the power and 
flexibility to allow the LAW library routines to hide the lower-level details of 
matrix element storage, and the serial or parallel multiplication strategy. All 
that is required is for the library user to ensure that they link against the serial 
or parallel LAW library dependent on their required execution model. Overall, 
we believe that this abstraction approach is necessary, useful and achievable 
for current and future classes of HPC libraries. 

In summary Fig. 1 illustrates the hierarchical nature of the LAW library with 
respect to the lower level BLAS and PBLAS/BLACS libraries. In the remain- 
der of this paper we fully describe the design of the LAW library, and with 
the aid of a case study, illustrate its potential for the rapid prototyping and 
development of non-trivial linear-algebra based codes for execution on both 
serial and parallel architectures. It is hoped that this singular approach to 
interfacing with multi-platform high performance libraries will help promote 
abstraction within emergent scientific libraries. 
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3 The Lineeir Algebra Wrapper LibrEiry (LAW) 

As discussed in §2 the LAW hbrary has been developed to provide a singular 
interface to a subset of common operations which can be found in the high 
performance BLAS and PBLAS linear algebra kernels. The motivating char- 
acteristics of this new interface is its independence from a given architectural 
model, whether it be sequential, multithreaded or distributed-memory paral- 
lel. To achieve this goal, it is imperative that abstraction be fully utihsed in 
the design of the library's interfaces such that architecture-dependent features 
are successfully hidden from the user. 

3.1 Identifying abstraction in the BLAS and PBLAS libraries 

Routines in the BLAS and PBLAS libraries can generally be classified by the 
following three attributes: 

(1) Numerical Type i.e. real, double precision, complex or double complex 

(2) Serial or parallel execution (parallel PBLAS routine names are prefixed 
with the letter P) 

(3) Functionality 

Arguably, for most scientific users, only property (3) should be considered 
relevant in the development of a solution to a given task. In an ideal program- 
ming environment, the developer should not be concerned with data types and 
the underlying execution model. Dynamically-typed languages such as Python 
and Ruby are making significant progress in challenging traditional statically- 
typed languages such as C, C-I-+ and Java in programming pedagogy[3] as 
well as in the development of modern industrial codes due their simpler type- 
devoid programming semantics. It is the authors view that HPC can benefit 
from these modern language developments, and when coupled with efficient 
numerical libraries will make a powerful and friendly programming environ- 
ment for scientific code developers with limited programming experience (see 
[5] for more details of recent research in exploiting the dynamically typed 
Python language in HPC). 

While it is impossible to protect the developer completely from matters of 
type in a statically-typed language such as Fortran, certain language features 
exist, like generic functions, which can be exploited to provide a single unified 
interface, independent of argument types. For instance, in the matrix-multiply 
invocation 3(a) discussed in §2, the invoked routine name should be impartial 
to whether the matrices A, B and C are real-valued, complex-valued, dou- 
ble precision, dense or sparse. Ideally a single matrix-multiply routine should 
accept and manipulate input objects of varied yet sensible type. The LAW 



8 




Real Complex Real Complex 



Fig. 2. High Level LAW Library Design 

Library uses this concept of ad-hoc polymorphism to simplify the library's 
programming interface. Furthermore, as will be shown in the following sec- 
tion, distinction between sequential and parallel execution can be completely 
hidden by the LAW library by exploiting this self-same language feature. 



3.2 The LAW Library Design Model 



Fig. 2 illustrates the top-level design characterising the LAW library. The li- 
brary comprises a top-level interface which provides generic routine names for 
a common set of linear algebra operations involving matrices and vectors. This 
layer is at the highest level of abstraction and the routines at this level are 
independent of numeric type or execution model. Below this layer (and invisi- 
ble to the user) are hidden the low-level concrete details of the library. At this 
sub-level the library differentiates between serial and parallel implementations 
and definitions of the lower level routines and types respectively. When a se- 
rial build of the library is requested, the high-level interfaces are mapped onto 
the concrete routines contained in the serial branch of the library. Likewise, 
during a parallel build the high-level LAW interfaces are mapped onto the 
implementations contained in the parallel branch of the library. 

Furthermore, the serial and parallel branches are composed of three (3) main 
components: 

(1) System Setup and Initialisation Routines 



9 



LAW_INITIALISE() LAW_EXIT() 

LAW_ID() LAW_NUMBER_NODES() 
LAW_BROADCAST_DATA() LAW_RECEIVE_DATA() 
Fig. 3. Routines provided by the System sub-library 

(2) Type Definitions 

(3) Linear Algebra Routine Implementations 

Component (3) is further subdivided to contain real and complex implemen- 
tations of the linear algebra routines. 



3.3 The System Routines 



The System routines are required for initialising the LAW library's underlying 
framework for supporting either serial or parallel execution (and associated 
inter-node communication) i.e. whether it is hosting the BLAS or PBLAS 
versions of the linear algebra routines. The system sub-library consists of a 
set of constructor, accessor and mutator routines which are used to initialise, 
modify and grant access to architecture-dependent properties governed by the 
number of available processing elements, their ID's and respective topology. 
A collection of routines are also provided for collective data communication 
and barrier synchronisation among the available nodes. Fig. 3 summarises the 
generic routines available in the System sub-library. 

When the LAW library is built for a serial architecture, minimal system setup 
and initialisation is required; effectively only a single node ID attribute is as- 
signed indicating a solitary processing element. The total number of available 
nodes is also set to 1. 



When the LAW library is built for a distributed-memory parallel architecture 
a more complex initialisation is performed. In this environment, multiple pro- 
cessing elements may be available, associated under a particular topologj 
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In ScaLAPACK this topology is primarily grid-based. During initialisation, a 
grid context is created (using user supplied row and column grid parameters 
read from an external fil^llJ) and process ID's are assigned to all participating 
processors dependent on their location within the grid context. This initiali- 
sation process is implemented using the BLACS library which contains a rich 
set of routines for creating process grids and interrogating it's properties from 



where ah processors execute an identical copy of the code in a Single Program 
Multiple Data (SPMD) fashion 
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currently referred to as grid.config 
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within each processing element. 

To reinforce this implementation duality consider Examples 4 and 5 which 
illustrate the System initialisation routines for both the serial and parallel 
branches of the LAW library. First and foremost, it should be noted that both 



Example 4 LAW_INIT1AL1SE() Routine for a Serial Architecture 



subroutine LAW_INITIALISE() 



Description: 

This routine initialises the framework for a serial 
architecture 



node_ID=0 ! Set ID to in a serial implementation 

total_nodes=l ! Only one PE available 

end subroutine LAW_INITIALISE 



routines maintain a consistent high-level interface yet their implementation de- 
tails are different, as described in §3.2. In effect, this allows the LAW library 
user to invoke the initialisation routine with the same routine name irrespec- 
tive of target architecture. Only at link time is the correct implementation 
selected as will be described in §4. 

As a further example of the duality in the LAW Library System routines 
(for both serial and parallel branches) consider the LAW-BARRIER() routine 
given in Examples 6 and 7. Since synchronisation is redundant on a serial 
architecture with a single processing node, this routine performs no actioij^. 
Conversely, the parallel barrier routine invokes the lower-level BLACS barrier 
routine and synchronises all available processing nodes at that point in the 
execution. 

With these simple yet important routines the scientific code developer can 
develop a single program code which displays diverse behaviour when compiled 
and linked on different architectures. For example, consider the LAW program 
shown in Example 8. When linked against the serial LAW library, the master 
will perform some independent work and then a single salutation message will 
be displayed. In comparison, when linked against the parallel LAW library 
and executed on p processors, p salutation messages will be displayed, after 
the master process completes it's task. 



^ "smart" compilers should be able to optimise 'away' this redundant invocation 
at compile time and therefore no performance penalty is accrued 
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Example 5 LAW_INITIALISE() Routine for a Parallel Architecture 



subroutine LAW_INITIALISE() 



! Description: 

! This routine initialises the framework for a distributed 
! architecture 

! Local Scalars 
logical : : successful 

I 

call initialise_grid(successf ul) 
if ( .not . successful) then 

print *, 'Error Declaring Process Grid !' 

call LAW_EXIT(abort) 
end if 

contains 

subroutine initialise_grid (successful) 



! Description: 

! This routine initialises a process grid and corresponding 
! context 

! Local Scalars 

logical , intent (out) :: successful 

j 

call read_grid_data() ! Read Grid Extents & Blocking Factors 
call blacs_get (-1,0, context) 

call blacs_gridinit (context , 'Row' ,grid_rows ,grid_columns) 
call blacs_gridinf o( context, grid_rows,grid_coluinns, & 
process_row,process_column) 

if (process_row==-l) then ! Grid initialised unsuccessfully 

successf ul= .false . 
else 

successf ul= . true . 
end if 

call blacs_pinf o(node_id,total_nodes) ! Get Node ID & Size 
end subroutine initialise_grid 
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Example 6 LAW_BARRIER() Serial Implementation 



subroutine LAW_BARRIER() 
end subroutine LAW_BARRIER 



Example 7 LAW_BARR1ER() Parallel Implementation 

subroutine LAW_BARRIER() 

call blacs_barrier (context , 'AO 
end subroutine LAW_BARRIER 

3.4 LAW Types 



The LAW library provides the user with a rich and powerful collection of aggre- 
gate types for declaring matrix and vector entities. Currently, one-dimensional 
(ID) vectors and two-dimensional (2D) matrix objects are supported across 
serial and parallel libraries. Recent updates have also seen the inclusion of 



a sparse matrix typd I for the serial library only and hence is not avail- 
able for codes whose target platform is a distributed-memory architecture. 
Future updates to the LAW library hope to include complementary sparse 
matrix types and routines for distributed-memory machines I to preserve the 
libraries portability across execution models. 

On serial architectures, the BLAS library represents matrix and vector ob- 
jects as conventional 2D and ID array declarations respectively (where array 
elements are stored contiguously in the memory-space using row-major dis- 
tribution). On distributed-memory architectures, the memory associated with 
matrix and vector structures is partitioned and distributed across the available 
processing elements. To manage this data distribution, array descriptors are 
associated with each distributed object, which records attributes describing 
the distribution of the global structure, including the global dimensions of the 
distributed object along with the row and column blocking factors. 

When declaring distributed entities, the ScaLAPACK developer is required 



to initialise the appropriate descriptors I and ensure that satisfactory mem- 



Using the Compact Storage Row representation (CSR) 

or wrappers to existing libraries such as PETSc that can provide parallel sparse 
matrix operations 

ScaLAPACK provides routines to aid the developer in initialising these descriptors 
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Example 8 Sample LAW Code 



program hello 
use LAW 

call LAW_IMITIALISE() 

if (LAW_ID==0) then 

. . . .master does some work 
end if 

call LAW_BARRIER() 
print *, 'Hello World' 
call LAW_EXIT(ok) 
end program hello 



ory storage is allocated and available on each processing node to contain it's 
locally distributed slice of the globally-distributed structure. As stated in §3 
the primary aim of the LAW library is to abstract away these architecture- 
dependent complexities and provide a uniform view of the execution and data 
storage model. In this context it is imperative that the user not be burdened 
with managing array descriptors if the target platform is a distributed-memory 
architecture. Furthermore, serial architectures require no reference to array 
descriptors and hence their declarations are redundant in a serial execution 
model. For these reasons, it is necessary that descriptors be hidden from the 
LAW library user at the library interface level. In principle the initialisation 
of array descriptors is entirely automatic assuming the user supplies sufficient 
information regarding the processor grid and row and column blocking fac- 



tora I. This automatic generation of distributed arrays is tightly coupled to 
the creation of matrix and vector objects on distributed-memory architec- 
tures and hence can be implicit in the parallel LAW library creation routines, 
a process which will be fully described in §3.7 

Example 9 highlights the complex-valued matrix and vector derived types 
provided by the serial branch of the LAW library. The vector and non-sparse 
matrix types contain pointers to vector and matrix elements which will be 
dynamically allocated during the execution of the LAW library's matrix and 
vector constructor routines discussed in §3.7. Further attributes are provided 
in the derived types to record the extents of the objects shapj^. The sparse 



which are read from external file during parallel LAW initialisation stage 
'^^ this allows easy referencing without appealing to Fortran's size intrinsic function 
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Example 9 LAW Serial Matrix and Vector Complex Aggregate Types 



type complex_vector 

complex (kind=precision) ,dimension( : ) , pointer 

integer 
end type coniplex_vector 



elements 



length=0 



type complex_2D_matrix 

complex (kind=precision) , dimension (: , :) , pointer : : 
integer : : 

end type complex_2D_matrix 



rows=0 , columns=0 



elements 



type complex_sparse_2D_matrix 



! Represented in CSR Form 



integer ,dimension( : ) , pointer 

integer , dimension ( : ) , pointer 

complex (kind=precision) ,dimension( : ) , pointer 

integer 



rowPointers 



columnlndices 
elements 

rows=0 , columns=0 



end type complex_sparse_2D_matrix 

matrix type contains pointer attributes that reflect the compact storage row 
(CSR) nature of the sparse representation. Similarly, components are provided 
that record the global nature of the sparse representation (which are necessary 
for operations over the sparse matrix type). 

The precision kind specifier can be set at compile-time to indicate either single 
or double precision. This kind specifier is referenced throughout the LAW 
library to provide consistent yet flexible multi-precision support within a single 
code. 

Furthermore, the LAW library provides similar type deflnitions for real- valued 
matrices and vectors where the complex Fortran data type has been replaced 
with the real data type. Type names are also modifled to indicate the real 
nature of the type's attributes. 

Example 10 shows the complex-valued matrix and vector aggregate types pro- 
vided by the parallel branch of the LAW library. It should be observed that 
the only signiflcant difference between serial and parallel matrix/vector types 
is the inclusion of the descriptor pointer, which will reference the descriptor 
array associated with the distributed matrix/vector elements, as required by 
the PBLAS routines. Furthermore, it should be noted that the type names are 
identical to those provided by the serial LAW library. Again this reinforces the 
library's goal of abstraction by providing a uniform interface to the library's 
features irrespective of the implementation architecture. 
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Example 10 LAW Parallel Matrix and Vector Complex Aggregate Types 



type complex_vector 

complex (kind=precision) ,dimension( : ) , pointer 

integer , dimension ( : ) , pointer 

integer 
end type complex_vector 



elements 

descriptor 

length=0 



type complex_2D_matrix 

complex (kind=precision) , dimension (: , :) , pointer 

integer .dimension ( : ) .pointer 

integer 
end type complex_2D_matrix 



elements 
descriptor 
rows=0 , columns=0 



3.5 LAW Interfaces 



As described in §3.4, the LAW library provides a rich selection of matrix and 
vector types with contrasting implementations on serial and parallel architec- 
tures. No physical memory is associated with a type until an instance of the 
type has been declared or created. For both matrix and vector types a set 
of creation routines may be required to accommodate every combination of 
type-value and representation. For instance a LAW matrix type can be either 
real or complex- valued and with either a dense or sparse representation. This 
requires a total of four creation roTitine implementations, one for each possible 
combination of type and representation e.g. 

(1) LAW_CREATE_2D_REAL_MATRIX() 

(2) LAW_CREATE_2D_C0MPLEX_MATR1X() 

(3) LAW_CREATE_SPARSE_2D_REAL_MATR1X() 

(4) LAW_CREATE_SPARSE_2D_C0MPLEX_MATRIX() 

In this scenario, the LAW library user is provided with a flexible and powerful 
collection of creation routines, yet it can be argued such an approach can ad- 
versely affect code readability and overcomplicate the development experience 
by drowning the user in a sea of routine names. An alternative approach is 
to provide a single interface to a set of routines (with similar semantics) but 
which are distinguished by the arity and type of their input parameters. The 
number and types of the inpiit arguments to the generic interface determines 
the concrete implementation that is to be invoked. This mechanism is called 
ad-hoc polymorphism and is facilitated in Fortran 95 by generic functions. To 
illustrate the application of generic functions in the LAW library consider the 
code given in Example 11 and Example 12. 

In Example 11a single LAW_CREATE_MATR1X() generic interface is defined 
and made available to the library user for the creation of matrix objects. An 
application of this interface is given in Example 12 whereby the user has 
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Example 11 LAW Matrix Creation Interface 



interface LAW_CREATE_MATRIX 

module procedure LAW_CREATE_2D_REAL_MATRIX 
module procedure LAW_CREATE_2D_C0MPLEX_MATRIX 
module procedure LAW_CREATE_SPARSE_2D_REAL_MATRIX 
module procedure LAW_CREATE_SPARSE_2D_C0MPLEX_MATRIX 

end interface 

requested two matrices with shape (100,100) to be created. 

Example 12 LAW Matrix Creation Sample Code 

type (real_2D_matrix) : : A 

type(complex_sparse_2D_matrix) : : B 

call LAW_CREATE_MATRIX ( A ,100,100) 
call LAW_CREATE_MATRIX (B , 100 , 100) 



In both cases the LAW_CREATE_MATRIX() interface is invoked but at com- 
pile time a specific creation implementation is determined from the list (l)-(4) 
above based on matching the types of their input arguments to the argu- 
ments passed to the generic interface. For instance, the LAW_CREATE_2D_- 
REAL_MATR1X() routine expects a matrix of type reaL2D_matrix to be 
passed in its first argument while the LAW_CREATE_2D_C0MPLEX_MATR1X() 
routine expects a matrix of type complex-sparse_2D_matrix to be passed. In 
Example 12, the LAW_CREATE_2D_REAL_MATRIX() and LAW_CREATE_- 
2D_C0MPLEX_MATRIX() routines will be dispatched by the compiler to 
allocate matrices A and B respectively, via the LAW_CREATE_MATRIX() 
generic interface. Where possible, generic interfaces have been used to provide 
a single point of invocation for many of the linear algebra operations provided 
by the LAW library. 

3.6 Overloaded Operators 

In association with the generic interface facility is the ability to overload intrin- 
sic operators and create new operators within the Fortran language. Overload- 
ing operators is achieved in a similar fashion to generic functions, whereby the 
unary or binary arguments to an operator are matched to a series of routines 
associated with the overloaded operator's interface. For instance. Example 
13 illustrates the overloading of the addition operator for matrix objects 
within the LAW library In this case both LAW_C0MPLEX_MATR1X_ADD() 
and LAW_REAL_MATR1X_ADD() are subroutines expecting two matrix in- 
put arguments of the same type. Depending on the specified matrix type 
(whether real or complex), the compiler will dispatch the appropriate routine 
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Example 13 Overloaded LAW Addition Operator 



interface operator (+) 

module procedure LAW_COMPLEX_MATRIX_ADD 

module procedure LAW_REAL_MATRIX_ADD 
end interface 

implementation. This is illustrated in Example 14, where the addition of two 
real matrices using the overloaded (+) intrinsic operator is illustrated. This 

Example 14 LAW Matrix Addition Sample Code 
type (real_2D_matrix) : : A,B 

call LAW_CREATE_MATRIX ( A , 100 , 100) 
call LAW_CREATE_MATRIX (B , 100 , 100) 

A = A + B 

overloading mechanism has been exploited liberally in the LAW library to pro- 
vide more natural interfaces to operations over matrix and vector objects (and 
their combinations). As a further example the matrix- vector multiplication of 
a matrix A and vector X can be expressed in the following form: 

type (real_2D_matrix) : : A 

type (real_vector) : : X, result 

result = A * X 

In this example the overloaded (*) operator accepts a real-valued matrix and 
a real- valued vector, and performs a matrix- vector multiplication using an ap- 
propriate lower-level BLAS or PBLAS routinS Furthermore, the assignment 
operator (=) has also been overloaded within the LAW Library to accept a 
LAW vector (or matrix) and copy its attributes to another, previously declared 
vector (or matrix). 

3.7 LAW Library Routine Implementations 

While the combination of generic functions and overloaded operators provides 
a flexible and powerful interface to the LAW library, the diversity of execution 
model is administered by the low-level LAW routine implementations. Con- 
sider the creation of matrix objects discussed in §3.5. Once a specific creation 
routine has been determined via the generic interface, contrasting actions are 
required depending on whether the mode of execution is serial or parallel. 



typically xGEMV and PxGEMV 
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On a serial system conventional contiguous array storage is administered, while 
on a distributed-memory system, a block-cyclic distribution of the matrix is 
required, with the initialisation of an appropriate array descriptor. Consider 
the implementation of LAW_CREATE_2D_EIEAL_MATRIX() routine within 
the serial branch of the LAW hbrary, as illustrated in Example 15. In this im- 

Example 15 Real Matrix Creation Routine for Serial Architectures 

subrout ine LAW_CREATE_2D_REAL_MATRIX (matrix , upper 1 , upper2 , status , & 
initialValue) 



Description: 

This routine creates an instance of a 2D real-valued array 



!use LAW_Types 



! Subroutine Arguments: 
! Scalars with Intent (in) 

integer , intent (in) :: upperl ,upper2 

real (kind=precision) , optional, intent (in) :: initialValue 
! Scalars with Intent (out) 

logical, intent (out) :: status 
! Arrays with Intent (out) 

type (real_2D_matrix) , intent (out) :: matrix 
! Local Scalars: 

integer : : error_f lag 

real(kind=precision) :: value 

! 

status= . true . 
value=0_precision 



if (present (initialValue) ) value=initial Value 



allocate (matrixy,elements (1 : upperl , 1 :upper2) ,stat=error_f lag) 
if (error_f lag==0) then 

matrixyoelements=value ! Initialise array elements 

matrix%rows=upperl 

matrix°/,columns=upper2 
else 

status=. false, 
end if 



end subroutine LAW_CREATE_2D_REAL_MATRIX 



plementation, memory is dynamically allocated for the elements in the matrix 
as determined by the extents of the dimensions (labelled as upperl and up- 
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per2) passed as arguments. Furthermore, if the optional initialValue argument 
is provided, then the matrix elements are initialised to the appropriate value. 
The extents of the matrix are recorded in the rows and columns components 
of the matrix object. Finally the modifications to the matrix object are made 
visible outside the routine via the intent (out) specifier. 

Example 16 details the implementation of the LAW_CREATE_2D_REAL_- 
MATRIX() routine for the parallel branch of the LAW library. In this imple- 
mentation, a distributed array is constructed from the global matrix extents 
using appropriate routines provided in the LAW System sub-library. In this 
single-program multiple- data (SPMD) parallel execution model each proces- 
sor is assigned the responsibility of allocating sufficient local storage for its 
local assignments of the distributed array as determined by the blocking fac- 
tors and defined processor topology. The low-level BLACS library provides a 
number of inquiry routines for determining the local row and column extents 
required by a given processing element, which is encapsulated in the LAW 
library's LAW_LOCAL_MATRIX_SIZES() System routine. Once each proces- 
sor determines its local matrix extents, local storage is allocated for the matrix 
elements. 



In addition, a distributed array descriptor of fixed sizqfH is constructed for 
describing the attributes and structure of the distributed array "^^ I. Again, low- 
level PBLAS /BLACS routines are invoked with appropriate arguments to cre- 
ate the associated array descriptor via the LAW JNITIALISE_DESCRIPTOR() 
wrapper routine. Finally the matrix elements and attributes are initialised in 
the same fashion as the serial implementation. 

To further highlight the diversification of serial and parallel implementations 
for a given routine within the LAW library consider the multiplication of 



two real matrices I implemented by the routine LAW_REAL_2D_MATRIX_- 
MULTIPLY(). The implementation within the serial LAW library branch is 
given in Example 17. 

In this serial implementation, the extents of the contributing matrices are de- 
termined using the matrix's attributes contained in the derived type compo- 
nents. If the extents do not conform for multiplication then an error message 
is displayed and the code terminates in a controlled manner. If the extents 
conform then storage for the resultant matrix is created. Finally a low-level 
BLAS xGEMM routine is invoked to perform the matrix multiplication. The 
resultant matrix C is then returned to the calling routine. 

One point of note is the use of pre-processing directives within the matrix- 



currently ScaLAPACK requires a ID array with 9 elements 
as required by PBLAS routines operating on distributed arrays 

24, 



'- expressed as C = A * B 
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Example 16 Real Matrix Creation Routine for Parallel Architectures 



subroutine LAW_CREATE_2D_REAL_MATRIX (matrix , upper 1 , upper2 , status , & 
initialValue) 



Description: 

This routine creates an instance of a distributed 2D 
real-valued array 



! Subroutine Arguments: 
! Scalars with Intent (in) 
integer , intent (in) 

real (kind=precision) , optional , intent (in) 
! Scalars with Intent (out) 
logical , intent (out) 
! Arrays with Intent (out) 
type (real_2D_matrix) , intent (out) 
! Local Scalars: 
integer 
integer 
integer 

real (kind=precision) 
I 



:: upper 1 ,upper2 
: : initialValue 

: : status 

: : matrix 

error_f lag 

global_rows , global_coluTrms 
local_rows , local_columns 
value 



status=.true . 
value=0 . O_precision 

if (present (initialValue) ) value=initialValue 

global_rows=upperl ! Calculate Global Row/Column sizes 
global_columns=upper2 

! Calculate Local Row/Col Sizes 

call LAW_LDCAL_MATRIX_SIZES(global_rows, & 

global_columns , local_rows , local_columns) 



allocate (matrix°/odescriptor (9) ) 



! Create Descriptor Vector 



! Initialise Descriptor 

call LAW_INITIALISE_DESCRIPTOR(matrix%descriptor, & 

global_rows ,global_columns , local_rows) 
allocate (matrixXelements (local_rows , local_columns) , stat=error_f lag) 



if (error_f lag==ok) then 

matrix°/oelements=value ! Initialise array elements 

matrixyorows=upperl 

matrixyocolumns=upper2 
else 

status=. false, 
end if 



end subroutine LAW_CREATE_2D_REALqt|lATRIX 



Example 17 Real Matrix Multiplication for Serial Architectures 
function LAW_REAL_2D_MATRIX_MULTIPLY(A,B) result (C) 



Description: 

This routine performs a matrix-matrix multiplication C = A * B 
on serial architectures 

!use LAW_Types 

! Subroutine Arguments: 
! Arrays with Intent (in) 

type (real_2D_matrix) .intent (in) :: A 

! Arrays with Intent (out) 

type (real_2D_matrix) , intent (in) :: B 

type(real_2D_matrix) : : C 

! Local Scalars 

integer : : m,n,k 

logical : : successful 



m = A°/orows 
n = B7oColumns 
k = A°/oColumns 

if (k /= B7orows) then 

print *, 'Multiplied Matrices Do Not Conform!' 

call LAW_EXIT (abort) 
else 

call LAW_CREATE_MATRIX(C,m,n, successful) ; 

C = B 

C°/orows=m 

C°/oColumns=n 

#if (PRECISIONFLAG==SINGLE) 

call sgemmCN', 'N' , m, n, k, 1 . 0_precision, A%elements, m, 
B%elements, k, . O_precision, C%elements, m) 

#else 

call dgemmCN', 'N' , m, n, k, 1 . 0_precision, AXelements, m, 
BXelements , k, . O_precision, C%elements, m) 

#endif 
end if 

end function LAW REAL 2D MATRIX MULTIPLY 
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multiplication routine for selecting between the scheduling of either single or 
double precision BLAS routines. During the build of the LAW library, the 
PRECISIONFLAG macro variable is set by the user to indicate either single 
or double precision representation. This same mechanism is used to set the 
precision kind specifier, which requests single or double precision for the real, 
complex and integer intrinsic types, at the outset of the compilatioJ^. If 
the PRECISIONFLAG is set to SINGLE, then all single precision BLAS (or 
PBLAS routines depending on whether a serial or parallel build is requested) 
will be selected, otherwise the alternative double precision routines will be 
invoked. 

In the parallel implementation of the real matrix-multiplication routine given 
in Example 18, the only distinction is that the BLAS xGEMM routine has 
been replaced by the equivalent PBLAS PxGEMM variant. Since all the at- 
tributes describing the matrix's structure (particularly the array descriptor) 
are packaged within the matrix derived type then minimal modifications need 
to be administered to generate the parallel PBLAS implementation from the 
serial BLAS template. All routines in the LAW library follow this fundamen- 
tal structure, such that the parallel implementations can be obtained readily 
from the serial implementation. Most importantly, via generic interfaces and 
derived types, the underlying execution model is completely hidden, with no 
direct reference to array descriptors required by the library user. 

While it may be argued that this extra protective wrapper may generate ad- 
ditional invocation performance penalties, it is believed that such penalties 
are negligible and can be amortized during the execution particularly during 
intensive numerical computations where the linear algebra kernels dominate 
the wall-clock time. It is hoped that further development of the LAW library 
will help streamline the library's wrapper functions such that maximum per- 
formance can be achieved at little cost. We firmly believe though, that any 
performance penalties can be justified in light of the productivity benefits the 
LAW library can bestow to the scientific code developer and the subsequent 
rapid development of linear-algebra based codes across multiple architectures. 

Furthermore, the LAW library can be easily be extended by the library user to 
interface with additional BLAS/PBLAS routines that are currently not sup- 
ported in the current version of the library. A standard structure is obeyed in 
the design of the library, as discussed in §3.2, and new wrappers and associ- 
ated implementations can easily be inserted with minimal effort. In addition, 
due to Fortran 95 's lack of privatized derived type component J^. all matrix 
and vector type attributes can be directly accessed by the library user (when 
in scope) e.g. the user can directly reference matrix or vector elements by 



this process will be described in further detail in 
to be fixed in the new Fortran 200x release 
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Example 18 Real Matrix Multiplication for Parallel Architectures 



function LAW_REAL_2D_MATRIX_MULTIPLY(A,B) result (C) 



! Description: 

! This routine performs a matrix-matrix multiplication 
! on serial architectures 

! Subroutine Arguments: 
! Scalars with Intent (in) 
! Arrays with Intent (in) 
type (real_2D_matrix) , intent (in) 
! Arrays with Intent (out) 
type(real_2D_matrix) , intent (in) 
type (real_2D_matrix) 
integer 
logical 

external : : psgemm , pdgemm 



m = Arrows 
n = BXcolumns 
k = AyoColumns 

if (k /= B%rows) then 
if (LAW_ID()==0) then 

print *, 'Multiplied Matrices Do Not Conform!' 
end if 

call LAW_EXIT (abort) 
else 

call LAW_CREATE_MATRIX(C,m,n, successful) ; 

C = B 

C°/orows=m 

C°/oColumns=n 

#if (PRECISIONFLAG==SINGLE) 

call psgemmCN', 'N' , m, n, k, 1 .0_precision, A°/oelements,l,l, & 
A'/odescriptor , B7oelements, 1,1, B%descriptor,0.0_precision, & 
C%elements , 1,1, C°/odescriptor) 

#else 

call pdgemmCN', 'N' , m, n, k, 1 .0_precision, AZelements,!,!, & 
A'/odescriptor , B7oelements, 1,1, B%descriptor,0.0_precision, & 
C°/oelements , 1,1, CXdescriptor) 
#endif 

end if 

end function LAW_REAL_2D_MATRIX_MULTIPLY 



B 
C 

m,n,k 
successful 
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applying the derived type accessor operator %. For instance, the elements of 
a vector can be explicitly intialised or interrogated as shown in Example 19. 

Example 19 Extraction of Matrix Object Attributes 

type(real_2D_matrix) : : A 
real(kind=precision) : : element 

A°/oelements(l) = 15.5 
element = A7oelements (1) 

Note that explicit element references on a parallel system refer directly to the 
local partition of the distributed matrix on the node making the reference. 
Hence accessing the first element in the locally distributed array, will most 



likely not refer to the first element in the global matrix/vector structure 
and therefore this practice of explicit array reference should be approached 
with great care. 

Notwithstanding this warning, experienced users can leverage the accessibility 
of the derived type components to interface to computational and driver rou- 
tines in the full LAPACK and ScaLAPACK libraries. Since the problematic 
aspect of distributed array allocation is automatically performed during the 
creation of matrix and vector types, the library user can exploit the available 
type components in interfacing with additional LAPACK and ScaLAPACK 
routines above and beyond those available in the BLAS/PBLAS libraries. 

For example, a user can easily interface with the LAPACK and ScaLAPACK 
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as 



divide and conquer diagonalization routines xSYEDV and xSYEDV 
shown in Example 20 (assuming prior creation of single precision matrices 
A and Z and appropriately allocated work arrays and parameters): 

Example 20 Interfacing with ScaLAPACK routines 

call SSYEVDC JDBZ, UPLO, Arrows , Ay.elements, A%rows, W, WORK, & 
LWORK, IWORK.LIWORK, INFO ) 

call PSSYEVDC JDBZ, UPLO, A7,rows, A'/elements, 1, 1, A'/odescriptor , & 
W, zy„elements, 1, 1, Z7,descriptor , WORK, LWORK, IWORK, LIWORK, INFO) 



3.8 Array Redistribution 



On parallel systems array redistribution! I is a useful technique for converting 



a distributed array A, governed by blocking factors r and c, into distributed 



^"^ apart from processor 

which are not currently implemented in the LAW hbrary 
29 invoked with the PBLAS redistribution routine PxGEMR2D 
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array B, governed by blocking factors r' and c . A consequence of this pro- 
cess allows a conventional Fortran array D, with shape {rows, columns) to be 
converted to/from a distributed array representation by associating D with a 
row blocking factor of size rows and column blocking factor of size columns 
i.e. D of shape (rows,columns) is distributed by distributing blocks of size 
(row, column); ensuring that D is distributed with all (row, columns) elements 
assigned locally to processor 0. This property allows a distributed array to be 
collected together on processor 0, or, a global array assigned locally to proces- 
sor to be distributed across the processor grid. These properties can be every 
useful when generating data at the master node with subsequent broadcasting 
to all other nodes, or for gathering data from all nodes back to the master for 
subsequent writing to file or master node processing. 

To illustrate these techniques consider Example 21. In this program, the mas- 
ter node reads data from an external file to array D on the master node. All 
processors then participate in the copy of D to the distribiitcd matrix A via the 
overloaded assignment operator. Through the assignment interface, it can be 
determined that the copy is from the conventional array D to the distributed 
matrix A. In this instance, the selected lower-level copy implementation rou- 
tine constructs a dummy array descriptor for D treating it as a distributed 
matrix with a row blocking value of 1000 and a column blocking value of 1000 
i.e. all elements are confined to master node (processor 0). 

The copy routine then performs a redistribution to the distributed matrix A 
(whose blocking factors are read from file during the LAW library initiahsa- 
tion phase at the outset of the program). At this point various linear algebra 
operations can be applied to A before the elements in the distributed matrix 
A are gathered back to D on the master node using a further redistribution 
from A to D (along with the dummy descriptor for D as described in the 
previous assignment). The master node can now output the new elements of 
D to external file. 

When the program given in Example 21 is compiled against the serial LAW 
library no redistributions are required (since only the master node exists) and 
all matrix copies are undertaken via conventional array copies local to the 
master. 



3.9 Data Value Broadcasting 

A relatively common task in parallel processing is for the master node to 
read some parameters from an external file and then broadcast those values 
to all other worker nodes participating in the calculation. The LAW library 
provides a wrapper to the BLACS broadcast routine xGEBS2D which can be 
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Example 21 LAW Redistribution Example 



program processing 

use LAW 

implicit none 

type (real_2D_matrix) : : A 
real(kind=precision) :: data(1000 , 1000) 
logical : : successful 

call LAW.INITIALISE (A , 1000 , 1000 , successful) 

! Read data from file at master 
if (LAW_ID()==0) then 

open (123 ,file= ' old. data' ,f orm=unf ormatted) 

read (123) data 

close(123) 
end if 

call LAW_CREATE_MATRIX(A) 

A = D ! Redistribute A on Master to all Processors 

.... some parallel processing of D .... 

D = A ! Redistribute distributed array A back to D 

! Write data to file from master 
if (LAW_ID()==0) then 

open(321 ,f ile='new.data' ,form=unf ormatted) 

write (321) data 
end if 

call LAW_DESTROY_MATRIX(A) 
call LAW_EXIT(ok) 
end program processing 



exploited for this purpose. Consider the implementation of the LAW interface 
LAW_BROADCAST_DATA() given in Example 22. 



In this routine it assumed that all data parameters to be broadcast are pack- 
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Example 22 LAW Library Broadcast Routine 



subroutine LAW_REAL_DATA_BROADCAST(data) 

real (kind=precision) ,diniension( :), intent (in) :: data 
integer : : elements 

external :: SGEBS2D,DGEBS2D 

elements=size (data,dim=l) 

#if (PRECISIONFLAG==SINGLE) 

CALL SGEBS2D(LAW_C0NTEXT() , 'All', ' elements, 1, data, elements ) 
#else 

CALL DGEBS2D(LAW_C0NTEXT() , 'All', ' ', elements, 1, data, elements ) 
#endif 

end subroutine LAW_REAL_DATA_BROADCAST 



aged contiguously as real- valued I items in an array I. which is passed to the 



routine. The data packet is then broadcast to all participating nodes via the 
xGEBS2D BLAGS routine. Similarly when receiving a broadcast data packet, 
the BLAGS collective receive routine xGEBR2D is invoked by all nodes to 
receive the packet as shown in Example 23. 

The data packet is then passed back to the calling routine for unpacking and 
processing. An example code segment illustrating the broadcast of parameters 
from the master node is given in Example 24. 

In this example, the master reads two values from an external data file and 
packs them into the message buffer associated with the array data. If there is 
more than one node participating in the calculation i.e. the code is linked to the 
parallel LAW library, then the message is broadcast to all other nodes. If the 
execution model is serial then no broadcast is made and the master proceeds 
with the remaining calculation. In a parallel execution, the non-master nodes 
participate in a collective receive operation, where upon successful completion, 
all nodes have a copy of the message packet, which is then unpacked to its 
constituent parts. All nodes can then proceed with a copy of the data values 
generated at the master node. 



a complementary complex- valued routine is also available for complex-based data 
similar to MPI message-passing 
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Example 23 LAW Library Receive Routine 



subroutine LAW_REAL_DATA_RECEIVE (data) 

real(kind=precisioii) jdimensionC : ) jintentCinout) :: data 
integer : : elements 

external :: SGEBR2D , DGEBR2D 

eleinents=size (data,dim=l) 

#if (PRECISIONFLAG==SINGLE) 

CALL SGEBR2D(LAW_C0NTEXT() , 'All', ' elements, 1, data, & 
elements, 0, ) 

#else 

CALL DGEBR2D(LAW_C0NTEXT() , 'All', ' ', elements, 1, data, & 
elements, 0, ) 
#endif 

end subroutine LAW_REAL_DATA_RECEIVE 



4 Building and Linking the LAW librciry 

4-1 Building 

It has been the goal of the previous sections to motivate the quahties of the 
LAW hbrary as well as describe it design and implementation by exploit- 
ing generic functions, overloaded operators and conditional compilation. In 
particular, the pre-processing of macro variables within the LAW code is an 
invaluable mechanism in selecting between the scheduling of single or double 
precision BLAS/PBLAS routines. In addition, macro variables are also em- 
ployed at the highest level of the LAW library, to select between serial and 
parallel branches of the library during the build phase. 

As highlighted in Example 3, the LAW library is provided as a Fortran 95 
module, which can be easily imported into a user's code via Fortran's use 
clause. This controlling module uses pre-processing directives to determine the 
precision and serial/parallel execution model for the resultant library build. 
An summary snapshot of the controlling LAW module is given in Example 
25. 

It should be noted that assorted LAW library modules can be generated de- 
pending on the contents of the macro variables SYSTEM and PRECISION- 
FLAG. Currently both these variables are set in the LAW Library make- 
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Example 24 LAW Library Broadcast /Receive Sample Code 



real (kind=precision) : : data(2) 

if (LAW_ID()==0) then 

! Read Configuration Parameters when master node 
read(file,*) valuel 
read (file,*) value2 

! If more than one processor broadcast data to all nodes 
if (LAW_MUMBER_N0DES()>1) then 

! Pack data for broadcast 
data(l)=real(number_qubits) 
dat a ( 2 ) =real (numberTimeSteps 1 ) 

! Broadcast data message 
call LAW_BROADCAST_DATA(data) 

end if 

else 

! All processors receive broadcast data from the master 
call LAW_RECEIVE_DATA(data) 

! Unpack data values 
value l=int (data(l) ) 
value2=int (data(2) ) 

end if 



fil4£H, and passed to the compiler using appropriate conditional compilation 
options. With these variables the compiler can build either the serial or parallel 
branches of the LAW library module by conditionally compiling the relevant 
sections (which encapsulate the appropriate selection of generic interfaces, 
types, system and low- level implementations), as dictated by the results of 
the conditional compilation selection tests. Furthermore, the precision kind 
specifier is also set via the PRECISIONFLAG macro. In total, four combina- 
tions of execution model and type are allowable. It should also be noted that 
the low-level implementation routines for both serial and parallel branches 
are classified into either vector, matrix or matrix-vector routines and housed 
relative the top-level LAW module in individual source files. This placement 
format allows for easy identification and source-modification of the low-level 
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the user is provided with a GUI tool to register their choices 
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Example 25 LAW Library Top-Level Driver Code 



module LAW 



! Single or Double Precision 
#if (PRECISIONFLAG==SINGLE) 

integer .parameter :: precision=kind(0 . 0) ! Single Precision Kind 
#else 

integer .parameter :: precision=kind(0.dO) ! Double Precision Kind 
#endif 



! Import Data Real and Complex LAW Data Types 

#if (SYSTEM==SERIAL) 

include " . /SERIAL/REAL/LAW_TYPES . f 90" 
include " . /SERIAL/COMPLEX/LAW_TYPES . f 90" 

#else 

include " . /PARALLEL/REAL/LAW_TYPES . f 90" 
include " . /PARALLEL/COMPLEX/LAW_TYPES . f 90" 
#endif 



! Import Real and Complex LAW Interfaces 
#if (SYSTEM==SERIAL) 

include " . /SERIAL/LAW_INTERFACES . f 90" 
#else 

include " . /PARALLEL/LAW_ INTERFACES . f 90 
#endif 



contains 



#if (SYSTEM==SERIAL) 

include " . /SERIAL/LAW_SYSTEM. f 90" 

include " . /SERIAL/REAL/VECTOR/LAW_VECTOR_ROUTINES . f 90" 
include " . /SERIAL/COMPLEX/VECTDR/LAW_VECTOR_ROUTINES . f 90" 
include " . /SERIAL/REAL/MATRIX/LAW_MATRIX_ROUTINES . f 90" 
include " . /SERIAL/COMPLEX/MATRIX/LAW_MATRIX_ROUTINES . f 90" 
include " . /SERIAL/REAL/MATRIX_VECTDR/LAW_MATRIX_VECTOR_ROUTINES . f 90" 
include " . /SERIAL/COMPLEX/MATRIX_VECTOR/LAW_MATRIX_VECTOR_ROUTINES . f 90" 
#else 

include " . /PARALLEL/LAW_SYSTEM . f 90" 

include " . /PARALLEL/REAL/VECTQR/LAW_VECTOR_ROUTINES .f 90" 
include " . /PARALLEL/COMPLEX/VECTOR/LAW_VECTOR_ROUTINES . f 90" 
include " . /PARALLEL/REAL/MATRIX/LAW_MATRIX_ROUTINES . f 90" 
include " . /PARALLEL/COMPLEX/MATRIX/LAW_MATRIX_ROUTINES . f 90" 
include " . /PARALLEL/REAL/MATRIX_VECTOR/LAW_MATRIX_VECTOR_ROUTINES . f 90" 
include " . /PARALLEL/COMPLEX/MATRIX_VECTOR/LAW_MATRIX_VECTOR_ROUTINES . f 90" 
#endif 



end module LAW 
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routines when necessary. 



Finally, after a successful compilation (for choices of execution and precision 
model), the LAW library object code is packaged as a static library for eventual 
linking to a user's code. 



4.2 LAW Library Usage 



Wrapper routines in the LAW library are imported into a user's code via 
Fortran 95 's use clause. During compilation of a user's code it is imperative 
that the include modules generated during the LAW library compilation are 
available to the compiler during compilation. This can be achieved by setting 
the appropriate include library path information in the user's code build in- 
structions. Typically this is performed with the -I option on the compiler's 
command line e.g. 

$F90 -c foo.f90 -I /path_to_LAW_library/include 

Finally, both LAPACK and ScaLAPACK libraries should be linked against the 
appropriate LAW library during the linking phase. Sample linking commands 
are as follows: 

! Serial Library Linking 

$(FC) -o foo foo.f90 -Iserial.LAW -llapack -Iblas 

! Multithreaded Library Linking 

$(FC) -0 foo foo.f90 -Iserial.LAW -Imkl 

! Distributed-Memory Library Linking 

$(FC) -0 foo foo.f90 -lparallel_LAW -Iblacs -Iscalapack -Impi 

If a parallel build of the LAW library is linked then the user must supply a 
grid.config parameter file to describe the the topology of the processor grid 
and the respective row and column blocking factors. Subsequent execution 
of the parallel code is then performed using a site-specific MPI execution 
command such as mpirun or mpiexec. If a multithreaded LAPACK library 
is linked against the LAW library then the user can experiment with the 
OMP_NU MJTHREADS environment variable prior to execution to exploit 
multiple threads during their calculation. 
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Fig. 4. Structure of a 16-spin Hamiltonian matrix 



5 Case Study 

As an example of the the LAW hbrary's versatihty we study the well known 

Chebyshev matrix exponentiation routine for a Hermitian matrix. Wc aim 
to demonstrate how, with the aid of the LAW library, we can simplify the 
development of a single code source, that can be executed across a variety of 
different target architectures. 

The matrix used, which we denote H, is the Hamiltonian of a finite configu- 
ration of the Kitaev Honeycomb spin- 1/2 lattice model [7]. It is a real matrix 
but can be made complex by adding an external magnetic field along an ap- 
propriate direction. There are a number of different configurations which we 
may use with a varying number (N) of spins. The matrices scale as 2"^ x 2^ 
although they can also be given a sparse representation, see Fig. 4. The model 
is important in the investigation into topological phases of matter and their 
application to topological quantum information processing. 

The Chebyshev matrix exponentiation routine is well known [8,9]. It can be 
used to generate the actual matrix exponential but is more commonly used 

to approximate the matrix-vector product using the Chebyshev polynomial 
expansion of a sparse Hermitian matrix. The main advantage of this routine is 
that one does not need to store the (usually) non-sparse matrix exponential. 
For this example we calculate exp(rH)v for some real but negative r. The 
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approximation takes the form 

m 

exp(rH)v ^ ^ akCk{B)v, 

k=0 



where Ck is the A;*'* Chebyshev polynomial of the first kind and 

B = r-'^i (2) 



with I being the identity matrix and 
A„ — Ai An + Ai 



ii — 



k = ^^^r^ (3) 



where A^ (Ai) are the maximum (minimum) eigenvalues of the original matrix 
H. This scaling and shifting is required because the Chebyshev polynomials 
are only defined on the real axis between [-1,1] and the eigenvalues of any 
matrix to be expanded must be within this range. The coefficients are used 
to compensate and they are given by 

ao = exp(W2)i"o(Tii), Ofe = 2exp(T/2)4(T;i) (4) 



where the Ik are the modified Bessel functions. The integer m in the sum in (1) 
typically depends on r and on the accuracy we require. Note that to compute 
the exponential exp(iHT)v a similar expansion can be used but using ordinary 
Bessel functions J^. 

It should be noted that the estimate for [Ai , A„] should be from outside that 
range and that the convergence rates of the algorithm depends on its accuracy. 
This problem of obtaining a good estimate for [Ai, A„] is addressed briefly in 
[9]. 

The LAW Library code for performing the Chebyshev expansion phase of this 
algorithm is given in Example 26. 

The mathematical objects fundamental to the propagation algorithm are the 
complex vectors vl, v2, v2), vJnitial, v^end and the complex normalized Hamil- 
tonian B. Since the size of these objects are a function of the basis size, it is 
advantageous that we apply a solution that can exploit multiple processing 
elements and the large memory of a distributed-memory system if it is avail- 
able. For this reason, we model these objects as flexible LAW Library matrix 
and vector types. 
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Example 26 Chebyshev Expansion LAW Library Code 



type (complex_vector) 

type(complex_2D_matrix) 

complex (kiiid=precisioii) 

integer 

logical 



vl , v2 , v3 , v_initial , v_end 
B 

alphaO , alpha 1 , alpha_k 
kjin 

successful 



call LAW_CREATE_VECTDR(vl,numberBasisStates, successful) 
call LAW_CREATE_ VECTOR (v2,numberBasisStates, successful) 
call LAW_CREATE_VECT0R(v3,numberBasisStates, successful) 
call LAW_CREATE_VECTDR(v_initial ,numberBasisStates , successful) 
call LAW_CREATE_VECTOR(v_end,numberBasisStates , successful) 

call LAW_CREATE_2D_MATRIX (B , numberBasisStates , numberBasisStates , successful) 

! Normalise Hamiltonian 
B = (1/11)*H + (-12/11)*! 

vl = v_initial 
v2 = B * vl 

v3 = vl 
v_end = v2 

alphaO = LAW_GET_VECTOR_ELEMENT ( alphas, l)/2 
alphal = LAW_GET_VECTOR_ELEMENT( alphas, 2) 

v_end = alpha0*v3 + v_end*alphal 

! — Main for loop for implementation of Chebyshev exp(-tau*H) 
do k = 2,m 

v3 = B * v2 

vl = 2.0*v3 + (-1.0)*vl 

alpha_k = LAW_GET_VECTOR_ELEMENT( alphas, k+1) 
v_end = alpha_k*vl + v_end 

v3 = vl 

vl = v2 
v2 = v3 

end do 

v_initial = v_end 
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As discussed in §3.7 we create instances of the vector and matrix objects 
using the generic LAW Library creation routines LAW-CREATE_VECTOR() 
and LAW_CREATE_MATRIX(). For vectors the creation routine requires the 
number of elements in the vector as an argument while the matrix creation 
routine requires the number of global rows and global columns in the matrix. 
The ancillary error flag labelled successful can be interrogated after each 
creation routine to determine if the dynamic memory allocation was successful 
or not. 

As described in §3 the flexibility of the LAW library routines becomes apparent 
when you link against either the serial or parallel branches of the library. If 
the program source is linked against the parallel LAW library then all created 
vectors and matrices are distributed across a processor grid (as determined 
by the grid topology and blocking factor parameters set in the grid.config 
file). When the source is compiled and linked against the serial LAW Library 
then no data-distribution is employed and a conventional array allocation is 
performed in the single memory-space. 

Prior to propagation, we obtain the normalised Hamiltonian given in Eq.(2) 



above I. This normalisation expression introduces matrix addition via the 



overloaded LAW operators (*) and (+). In the expression 

B = (1/11)*H + (-12/11)*! 

the matrices H and / are first scaled (since multiplication has a higher prece- 
dence than addition, in operator application). In a parallel system the scaling 
is performed across all processing elements and their associated local matrix 
partitions using a lower level parallel PBLAS scaling routine for performance. 
On a serial system the scaling is performed using a conventional BLAS scaling 
routine. The resultant scaled matrices then become input to the overload ad- 
dition (-f) operator where corresponding elements in each matrix are summed 
using lower level PBLAS or BLAS vector addition routines. The derived type 
attributes of the resultant matrix are then copied across to the left-hand side 
matrix B. 



The algorithm proceeds with a series of vector assignments. The overloaded 
assignment operator (=) provided by the LAW library will ensure that the 
derived type attributes of the objects on the right of the assignment will 
be copied to the corresponding attributes on the left of the assignment, as 
discussed in §3.6. Special note should be made of the assignment to vector 
v2. Before the assignment completes, a matrix- vector multiply operation is 
performed via the overloaded product operator (*). Once again, the concrete 
matrix- vector multiply implementation is dependent on the branch of the LAW 

For the sake of clarity we will assume H, 1, 11 and 12 have been previously declared 
and initialised 
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library that is linked. If the serial library is incorporated, then the matrix- 
vector multiply is performed serially using the xGEMV BLAS routine. For 
the parallel LAW library, the product is performed in a distributed fashion 
across the processor grid using the corresponding parallel PBLAS PxGEMV 
routine. 

Once the assignments are complete, the alpha coefficients are computed as de- 
scribed in Eq.(4f^. The retrieval of a single value from a vector object can be 
accomplished with the LAW_GET_VECTOR_ELEMENT() library function. 
By supplying the global index of the required vector element as an argument 
to the routine, that element is retrieved irrespective of whether the element is 
stored locally or on a remote processing element (if the parallel LAW library 
is used). The calculation of the exact location of the global index is performed 
implicitly by the routine and the returned value is broadcast to all calling 
nodes J 



The remainder of the propagation phase is carried out using combinations of 
the LAW library routines and functions already described. It should be clear 
from this example how a single version of the Chebyshev Matrix Exponen- 
tiation algorithm can be constructed, such that will execute on both serial 
and parallel systems without any modification to the source, by exploiting the 
high-level generic interfaces provided by the LAW library. As discussed in §4, 
multithreaded BLAS routines can be used in conjunction with the serial LAW 
library to incorporate the performance of multiple threads which are available 
on Symmetric Multiprocessing (SMP) systems. 



6 Conclusions 

In this paper we have introduced the LAW Library and showed that by ex- 
ploiting advanced programming constructs in Fortran 90-|- we can implement 
a set of high-level generic interfaces that allow the development of portable 
codes based upon BLAS and PBLAS linear algebra kernels. We believe this 
approach provides a number of benefits: 

• Rapid-prototyping of scientific codes which exploit BLAS/PBLAS calls 

• Automatic handling of processor grids and array descriptors for distributed- 
memory systems 

Within the code, the vector alphas contains pre-computed coefficients where 
alphas{k) = 2 exp(rZ2)/fe(r^i) 

In practice, the alphas vector is small enough to not require distribution across 
the nodes in a parallel system, and hence does not need to be represented as a LAW 
vector object. We took the liberty to use a vector object in order to introduce the 
LAW.GET_VECTOR.ELEMENT() routine 
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• Automatic redistribution of distributed matrices an overloaded assignment 

operator 

• Only a single code source is required for diverse target architectures 

• The library is easily extensible to more linear algebra routines and opera- 
tions 

• Simpler and more natural interface to standard linear algebra kernels 

With the aid of a case study we have shown how LAW Library routines can 
simplify the development of non-trivial linear algebra calculations with a more 
natural and readable interface. Future developments will ensure that the set of 

matrix and vector operations is expanded for both serial and parallel systems 
under a single generic calling interface. This will include the provision for both 
serial and parallel sparse vector and matrix operations. 

We hope that the LAW library will be found to be an invaluable tool for par- 
allel code developers, in particular those with little or no previous experience 
with parallel communications and message-passing frame-works. 
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